Windows Server 2008 : Recovering from a Disaster - When Disasters Strike

11/19/2010 2:46:28 PM

When a failure or disaster strikes is when not only having, but also following, a disaster recovery plan is most important. Having a procedure or checklist to follow allows all involved parties to be on the same page and understand what steps are being taken to rectify the situation. The following sections detail steps that can be followed to ensure that no time is wasted and resources are not being led in the wrong direction.

Qualifying the Disaster or Failure

When a system failure occurs or is reported as failed, the information can come from a number of different sources and should be verified. The reported issue can be caused by user or operator error, network connectivity, or a problem with a specific user account configuration or status. A reported system failure should be verified as failed by performing the same steps reported by the reporting party.

If the system is, in fact, in a failed state, the impact of the failure should be noted, and this information should be escalated within the organization so that a formal recovery plan can be created. This can be known as qualifying the disaster or failure. An example of qualifying a failure includes a short description of the failure, the steps used to validate the failure, who is affected, how many end users are affected, which dependent applications or systems are affected, which branch offices are affected, and who is responsible for the maintenance and recovery of this system.

Validating Priorities

When a disaster strikes that affects an entire server room or office location, the priority of restoring systems and operations should already be determined. First and foremost are the core infrastructure systems, such as networking and power, followed by authentication systems, and the remaining core bare minimum services. In the event of a failure that involves multiple systems—for example, a web server failure that supports 10 separate applications—the priority of recovery should be presented and approved by management. If each of these 10 applications takes 30 minutes to recover, it could be 5 hours before the system is fully functional, but if one particular application is critical to business operations, this application should be recovered first. Always perform checkpoints and verification to ensure that the priorities of the organization are in line with the recovery work that is being performed.

Assume and Be Doomed

Disaster, system failures, and data corruption issues tend to create a lot of stress and havoc among technical business personnel. Recovery administrators and managers should always be on the same page regarding the priority of recovery and the process. Also, get this communication in paper or electronic format because it might be required later to justify why a choice was made. Those administrators who decide to move forward on resolving an issue based on assumptions and not by first communicating with their managers might find themselves in a very sticky situation, especially if the results of their actions prove to be unsuccessful or end up causing more problems.

Synchronizing with Business Owners

Prioritizing the recovery of critical and bare minimum business systems is part of disaster recovery planning. When a situation strikes that requires an entire data center or group of systems to be restored or recovered, the steps that will be followed need to be put back in front of the business owners again. Please remember that between the time a disaster recovery plan is created and the time the failure occurs, business priorities might have shifted and the business owners might be the only ones aware of this change. During a recovery situation, always take the time to stay calm and focused and communicate with the managers, executives, and business owners so that they can be informed of the progress. An informed business owner is less likely to stay in the server room or data center if they feel that recovery efforts are in good hands.

Communicating with Vendors and Staff

When failures or disasters strike, communication is key. Regardless of whether customers, vendors, employees, or executives are affected, some level of communication is required or suggested. This is where the soft skills of an experienced manager, sales executive, technical consultant, and possibly even lawyers can be most valuable. Providing too much information, information that is too technical, or, worst of all, incorrect or no information, is a mistake technical staff frequently make. My recommendation to technical staff is to only communicate with your direct manager or his or her boss if they are not available. If the CEO or an end user asks for an update, try to defer to the manager as best you can, so that focus can be kept on restoring services.

Assigning Tasks and Scheduling Resources

The situation is that we have a failure, we have an approved plan, we have communicated the situation, and we are ready to begin fixing the issue. The next step is to delegate the specific tasks to the qualified staff members for execution. As stated previously, hand off communication to a manager or spokesperson and only communicate through them if possible. Determining who will restore a particular system is as important if not more important than assigning communication responsibilities. Only certain technical staff members might be qualified to restore a system, so selecting the correct resource is essential.

When a serious failure has occurred, recovery efforts might require multiple technical resources onsite for an extended period of time. Furthermore, there might be dependencies that affect which systems can be restored, and, of course, the order or priority of restore will advance or delay the recovery of a system. Mapping out the extended recovery timeline and technical resource scheduling ensures that a technical resource is not onsite until their skills and time are required. Also, rotating technical resources after six to eight hours of time helps to keep progress moving forward.

Keeping the Troops Happy

This section goes out to all technical leads, project managers, IT managers, business owners, and executives. If you have technical resources working for you in an effort to recover from a failure, you should do all you can to ensure that these technical resources are kept happy and focused. For starters, try to keep the end users and any other business owners or executives from bothering this staff. Regular communication will help with this task tremendously. Next, and possibly more important, provide all the bottled water, soda, coffee, snacks, food, breaks, and anything else that will keep these professionals happy, healthy, and focused on the task at hand. Technical staff will work very hard during disaster situations, so don’t forget to pat them on the back and let them know how much the organization and you personally appreciate their time and commitment.

Recovering the Infrastructure

After the failure has been validated, the initial communications meetings have been held, restore tasks have been confirmed and possibly reprioritized, and recovery task assignment of resources has been completed, the recovery efforts can finally begin. Verify that each technical resource has all the documentation, phone numbers, software, and hardware they require to perform their task. Hold periodic checkpoint meetings, starting every 15 minutes and tapering off to every 30 or 60 minutes as recovery efforts continue.

Postmortem Meeting

After a system failure or disaster strikes, and the recovery has been completed, an organization should hold a meeting to review the entire process. The meeting might just be an event where individuals are recognized for their great work; however, the meeting will most likely involve reviewing what went wrong and identifying how the process could be improved in the future. A lot of interesting things will happen during disaster recovery situations—both unplanned and simulated—and this meeting can provide the catalyst for ongoing improvement of the processes and documentation.